The riparian plot is the primary method to visualize syntenic relationships among >2 species. GENESPACE offers this functionality and considerable customization opportunities. Here, we will illustrate some of these possibilities.
Typically, the user will run the GENESPACE pipeline via
run_genespace(gsParam = x), then call the function
plot_riparian(gsParam = x) where x is the GENESPACE
parameter list. This method has the benefit of using the pre-checked
file paths stored in x and can thus calculate and access
the block coordinates without any user specifications. Since
plot_riparian() requires access to the ‘syntenic hits’ text
files, which can be very large, this tutorial does not include the
source data to repeat the exact analyses. Instead users will have to
construct the entire GENESPACE run from the scripts included below. The
full run takes about 20 minutes to complete.
library(GENESPACE)
## GENESPACE v1.1.9: synteny and orthology constrained comparative
## genomics
NOTE This section is copied from the primary GENESPACE guide. If you have already run that, no need to do this part.
Set the paths for your run. Change these paths to those valid on your system
genomeRepo <- "~/path/to/store/rawGenomes"
wd <- "~/path/to/genespace/workingDirectory"
path2mcscanx <- "~/path/to/MCScanX/"
Download the data of interest. Here we are going to use human, mouse, platypus, chicken, and sand lizard. To do this programatically, get the URLs to the genome annotations, but leave off the ‘genomic.gff.gz’ string at the end, then make the full paths by appending ‘translated_cds.faa.gz’ and ‘genomic.gff.gz’ to the ends of the paths. Finally make the directories to write to and download the files.
urls <- c(
human ="000/001/405/GCF_000001405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_",
mouse = "000/001/635/GCF_000001635.27_GRCm39/GCF_000001635.27_GRCm39_",
platypus = "004/115/215/GCF_004115215.2_mOrnAna1.pri.v4/GCF_004115215.2_mOrnAna1.pri.v4_",
chicken = "016/699/485/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b_",
sandLizard = "009/819/535/GCF_009819535.1_rLacAgi1.pri/GCF_009819535.1_rLacAgi1.pri_")
genomes2run <- names(urls)
urls <- file.path("https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF", urls)
translatedCDS <- sprintf("%stranslated_cds.faa.gz", urls)
geneGff <- sprintf("%sgenomic.gff.gz", urls)
names(translatedCDS) <- genomes2run
names(geneGff) <- genomes2run
writeDirs <- file.path(genomeRepo, genomes2run)
names(writeDirs) <- genomes2run
for(i in genomes2run){
print(i)
if(!dir.exists(writeDirs[i]))
dir.create(writeDirs[i])
download.file(
url = geneGff[i],
destfile = file.path(writeDirs[i], basename(geneGff[i])))
download.file(
url = translatedCDS[i],
destfile = file.path(writeDirs[i], basename(translatedCDS[i])))
}
Parse the annotations to fastas with headers that match a gene bed file
genomes2run <- c("human", "mouse", "platypus", "chicken", "sandLizard")
parsedPaths <- parse_annotations(
rawGenomeRepo = genomeRepo,
genomeDirs = genomes2run,
genomeIDs = genomes2run,
presets = "ncbi",
genespaceWd = wd)
Conduct the GENESPACE run … depending on your machine, this can take a few minutes up to an hour.
gpar <- init_genespace(
wd = wd,
ploidy = 1,
path2mcscanx = path2mcscanx)
out <- run_genespace(gpar, overwrite = T)
## Loading objects:
## gsParam
Here is what the default plot looks like using the human genome as the reference. Each syntenic block, phased and colored by the human reference genome chromosomes, are visualized as braids. Each chromosome is a rounded rectangle polygon.
ripDat <- plot_riparian(
gsParam = out,
refGenome = "human",
forceRecalcBlocks = FALSE)
The forceRecalcBlocks argument (default = TRUE) ignores
any existing reference-phased block coordiates and re-calculates them
according to the specifications in the plot_riparian
function call. In general, this is a good idea to ensure that your
parameters are being respected; however, it can be much faster to not
recalculate block coordinates, especially in large runs.
For a full list of all the parameters, see the end of this tutorial.
Instead of aggregated syntenic regions, we can plot just the fully
collinear syntenic blocks using useRegions = FALSE. We can
also use physical position instead of gene rank order with
useOrder = FALSE.
ripDat <- plot_riparian(
out,
refGenome = "human",
useOrder = FALSE,
useRegions = FALSE)
By default, the genomes are ordered following the positions in the
species tree from orthofinder. We can adjust this using the
genomeIDs parameter. We can also switch up which genome is
used to phase the blocks with refGenome.
ripDat <- plot_riparian(
gsParam = out,
refGenome = "mouse",
genomeIDs = c("mouse", "human", "platypus", "chicken"),
forceRecalcBlocks = FALSE)
By default, syntenyWeight = 0.5, which means that
chromsomes are re-ordered across the genomes to maximize synteny while
also preserving local ordering of chromosomes by the order of chromosome
names. For example, if both Chr1 and 2 are at similar syntenic
positions, but Chr2 is slightly more to the left than Chr1, Chr1 will
still come first in the synteny plot. If syntenyWeight = 1,
only synteny is considered and names are ignored. If
syntenyWeight = 0 (equivalent to
reorderBySynteny = FALSE), only chromosome names are used
and synteny is ignored.
ripDat <- plot_riparian(
gsParam = out,
#reorderBySynteny = FALSE,
syntenyWeight = 0,
refGenome = "human")
We can also change how the chromosomes are ordered using two
approaches. Either using a custom order for the reference genome with
customRefChrOrder.
ripDat <- plot_riparian(
gsParam = out,
refGenome = "human",
customRefChrOrder = c("X", 1:22))
Or alternatively, by supplying a function that re-orders the all the
chromosomes. By default, refChrOrdFun orders the reference
genome chromosomes so that first chromosomes with smaller numeric values
are on the left, then break the ties by chromosomes that come earlier in
the alphabet are to the left, and if still tied, choose a random one.
But we can mix this up. For example, any chromosomes that are not
numbered by an integer go first, then the integers:
ordFun <- function(x)
data.table::frank(
list(!grepl("[a-zA-Z]", x),
as.numeric(gsub("[a-zA-Z]", "", x))),
ties.method = "random")
ripDat <- plot_riparian(
gsParam = out,
refChrOrdFun = ordFun,
refGenome = "human",
reorderBySynteny = FALSE)
By default,
plot_riparian(minChrLen2plot = ifelse(useOrder, 100, 1e5))
excludes any scaffolds that have < 100 genes (is using gene rank
order) or < 100kb (if using physical position). We can plot all
chromosomes by setting this to 0, or only plot larger chromosomes by
increasing this value. For example, show even the tiny little
scaffolds:
ripDat <- plot_riparian(
gsParam = out,
minChrLen2plot = 0)